{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "35e166bf",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Attention Mechanisms and Transformers\n",
    ":label:`chap_attention-and-transformers`\n",
    "\n",
    "\n",
    "The earliest years of the deep learning boom were driven primarily\n",
    "by results produced using the multilayer perceptron,\n",
    "convolutional network, and recurrent network architectures.\n",
    "Remarkably, the model architectures that underpinned\n",
    "many of deep learning's breakthroughs in the 2010s\n",
    "had changed remarkably little relative to their\n",
    "antecedents despite the lapse of nearly 30 years.\n",
    "While plenty of new methodological innovations\n",
    "made their way into most practitioner's toolkits---ReLU\n",
    "activations, residual layers, batch normalization, dropout,\n",
    "and adaptive learning rate schedules come to mind---the core\n",
    "underlying architectures were clearly recognizable as\n",
    "scaled-up implementations of classic ideas.\n",
    "Despite thousands of papers proposing alternative ideas,\n",
    "models resembling classical convolutional neural networks (:numref:`chap_cnn`)\n",
    "retained *state-of-the-art* status in computer vision\n",
    "and models resembling Sepp Hochreiter's original design\n",
    "for the LSTM recurrent neural network (:numref:`sec_lstm`),\n",
    "dominated most applications in natural language processing.\n",
    "Arguably, to that point, the rapid emergence of deep learning\n",
    "appeared to be primarily attributable to shifts\n",
    "in the available computational resources\n",
    "(thanks to innovations in parallel computing with GPUs)\n",
    "and the availability of massive data resources\n",
    "(thanks to cheap storage and Internet services).\n",
    "While these factors may indeed remain the primary drivers\n",
    "behind this technology's increasing power\n",
    "we are also witnessing, at long last,\n",
    "a sea change in the landscape of dominant architectures.\n",
    "\n",
    "At the present moment, the dominant models\n",
    "for nearly all natural language processing tasks\n",
    "are based on the Transformer architecture.\n",
    "Given any new task in natural language processing, the default first-pass approach\n",
    "is to grab a large Transformer-based pretrained model,\n",
    "(e.g., BERT :cite:`Devlin.Chang.Lee.ea.2018`, ELECTRA :cite:`clark2019electra`, RoBERTa :cite:`Liu.Ott.Goyal.ea.2019`, or Longformer :cite:`beltagy2020longformer`)\n",
    "adapting the output layers as necessary,\n",
    "and fine-tuning the model on the available\n",
    "data for the downstream task.\n",
    "If you have been paying attention to the last few years\n",
    "of breathless news coverage centered on OpenAI's\n",
    "large language models, then you have been tracking a conversation\n",
    "centered on the GPT-2 and GPT-3 Transformer-based models :cite:`Radford.Wu.Child.ea.2019,brown2020language`.\n",
    "Meanwhile, the vision Transformer has emerged\n",
    "as a default model for diverse vision tasks,\n",
    "including image recognition, object detection,\n",
    "semantic segmentation, and superresolution :cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021,liu2021swin`.\n",
    "Transformers also showed up as competitive methods\n",
    "for speech recognition :cite:`gulati2020conformer`,\n",
    "reinforcement learning :cite:`chen2021decision`,\n",
    "and graph neural networks :cite:`dwivedi2020generalization`.\n",
    "\n",
    "The core idea behind the Transformer model is the *attention mechanism*,\n",
    "an innovation that was originally envisioned as an enhancement\n",
    "for encoder--decoder RNNs applied to sequence-to-sequence applications,\n",
    "such as machine translations :cite:`Bahdanau.Cho.Bengio.2014`.\n",
    "You might recall that in the first sequence-to-sequence models\n",
    "for machine translation :cite:`Sutskever.Vinyals.Le.2014`,\n",
    "the entire input was compressed by the encoder\n",
    "into a single fixed-length vector to be fed into the decoder.\n",
    "The intuition behind attention is that rather than compressing the input,\n",
    "it might be better for the decoder to revisit the input sequence at every step.\n",
    "Moreover, rather than always seeing the same representation of the input,\n",
    "one might imagine that the decoder should selectively focus\n",
    "on particular parts of the input sequence at particular decoding steps.\n",
    "Bahdanau's attention mechanism provided a simple means\n",
    "by which the decoder could dynamically *attend* to different\n",
    "parts of the input at each decoding step.\n",
    "The high-level idea is that the encoder could produce a representation\n",
    "of length equal to the original input sequence.\n",
    "Then, at decoding time, the decoder can (via some control mechanism)\n",
    "receive as input a context vector consisting of a weighted sum\n",
    "of the representations on the input at each time step.\n",
    "Intuitively, the weights determine the extent\n",
    "to which each step's context \"focuses\" on each input token,\n",
    "and the key is to make this process\n",
    "for assigning the weights differentiable\n",
    "so that it can be learned along with\n",
    "all of the other neural network parameters.\n",
    "\n",
    "Initially, the idea was a remarkably successful\n",
    "enhancement to the recurrent neural networks\n",
    "that already dominated machine translation applications.\n",
    "The models performed better than the original\n",
    "encoder--decoder sequence-to-sequence architectures.\n",
    "Furthermore, researchers noted that some nice qualitative insights\n",
    "sometimes emerged from inspecting the pattern of attention weights.\n",
    "In translation tasks, attention models\n",
    "often assigned high attention weights to cross-lingual synonyms\n",
    "when generating the corresponding words in the target language.\n",
    "For example, when translating the sentence \"my feet hurt\"\n",
    "to \"j'ai mal au pieds\", the neural network might assign\n",
    "high attention weights to the representation of \"feet\"\n",
    "when generating the corresponding French word \"pieds\".\n",
    "These insights spurred claims that attention models confer \"interpretability\"\n",
    "although what precisely the attention weights mean---i.e.,\n",
    "how, if at all, they should be *interpreted* remains a hazy research topic.\n",
    "\n",
    "However, attention mechanisms soon emerged as more significant concerns,\n",
    "beyond their usefulness as an enhancement for encoder--decoder recurrent neural networks\n",
    "and their putative usefulness for picking out salient inputs.\n",
    ":citet:`Vaswani.Shazeer.Parmar.ea.2017` proposed\n",
    "the Transformer architecture for machine translation,\n",
    "dispensing with recurrent connections altogether,\n",
    "and instead relying on cleverly arranged attention mechanisms\n",
    "to capture all relationships among input and output tokens.\n",
    "The architecture performed remarkably well,\n",
    "and by 2018 the Transformer began showing up\n",
    "in the majority of state-of-the-art natural language processing systems.\n",
    "Moreover, at the same time, the dominant practice in natural language processing\n",
    "became to pretrain large-scale models\n",
    "on enormous generic background corpora\n",
    "to optimize some self-supervised pretraining objective,\n",
    "and then to fine-tune these models\n",
    "using the available downstream data.\n",
    "The gap between Transformers and traditional architectures\n",
    "grew especially wide when applied in this pretraining paradigm,\n",
    "and thus the ascendance of Transformers coincided\n",
    "with the ascendence of such large-scale pretrained models,\n",
    "now sometimes called *foundation models* :cite:`bommasani2021opportunities`.\n",
    "\n",
    "\n",
    "In this chapter, we introduce attention models,\n",
    "starting with the most basic intuitions\n",
    "and the simplest instantiations of the idea.\n",
    "We then work our way up to the Transformer architecture,\n",
    "the vision Transformer, and the landscape\n",
    "of modern Transformer-based pretrained models.\n",
    "\n",
    ":begin_tab:toc\n",
    " - [queries-keys-values](queries-keys-values.ipynb)\n",
    " - [attention-pooling](attention-pooling.ipynb)\n",
    " - [attention-scoring-functions](attention-scoring-functions.ipynb)\n",
    " - [bahdanau-attention](bahdanau-attention.ipynb)\n",
    " - [multihead-attention](multihead-attention.ipynb)\n",
    " - [self-attention-and-positional-encoding](self-attention-and-positional-encoding.ipynb)\n",
    " - [transformer](transformer.ipynb)\n",
    " - [vision-transformer](vision-transformer.ipynb)\n",
    " - [large-pretraining-transformers](large-pretraining-transformers.ipynb)\n",
    ":end_tab:\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}